Name: Syeduzzaman Khan

Project Title: A recipe for success as a movie producer

  1. Objectives:

Movies are one of the entertainment sources of our time. The history of the movie is century old. From the beginning to till date movies attract viewers all over the world. At present, all most each and every country has a movie industry. The project dataset contains the movie industry's data from 1986 to 2016.

The purpose of this project is to analysis the movie industry's over 30 years of data and explains the decision to Steven Spielberg. Therefore, he can invest in the film based on the analyzed data.

To achieve Steven Spielberg's objective, we have to carefully analyze the dataset considering different perspectives. At the beginning of the project, we will get too familiar with the dataset and look into different columns. Then, we will make logical segmentation based on interesting features that will lead us to reach our goals. The obtained results will be visualized using static and dynamic plots.

2. Data Exploration:

Read Dataset:

In [1]:
# read csv file 
dataset <- read.csv ("movies.csv", na.strings="",stringsAsFactors=FALSE)

2.1 Print top 5 rows

In [2]:
head (dataset,n=5) # print top 5 rows
budgetcompanycountrydirectorgenregrossnameratingreleasedruntimescorestarvoteswriteryear
8000000 Columbia Pictures Corporation USA Rob Reiner Adventure 52287414 Stand by Me R 1986-08-22 89 8.1 Wil Wheaton 299174 Stephen King 1986
6000000 Paramount Pictures USA John Hughes Comedy 70136369 Ferris Bueller's Day Off PG-13 1986-06-11 103 7.8 Matthew Broderick 264740 John Hughes 1986
15000000 Paramount Pictures USA Tony Scott Action 179800601 Top Gun PG 1986-05-16 110 6.9 Tom Cruise 236909 Jim Cash 1986
18500000 Twentieth Century Fox Film CorporationUSA James Cameron Action 85160248 Aliens R 1986-07-18 137 8.4 Sigourney Weaver 540152 James Cameron 1986
9000000 Walt Disney Pictures USA Randal Kleiser Adventure 18564613 Flight of the Navigator PG 1986-08-01 90 6.9 Joey Cramer 36636 Mark H. Baker 1986

2.2 Number of rows and Number of columns:

In [3]:
cat ("Number of Rows: ",nrow(dataset)) # calculate number of rows
cat ("\nNumber of Columns: ",ncol(dataset)) # calculate number of coloumns
Number of Rows:  6820
Number of Columns:  15

The dataset is crediable. It has moderate rows and columns.

2.3 Top Five Budget Films:

In [4]:
Budget<-dataset$budget # sotre budget 
Budget[Budget=="NA"] <- "0" # missing values set to zero 
Budget<-as.numeric(Budget)# convert to number
head(sort(Budget,decreasing=TRUE),5)
  1. 3e+08
  2. 2.6e+08
  3. 2.58e+08
  4. 2.5e+08
  5. 2.5e+08

The amount of top budegt films were really high. Normally, science fiction or war films need more budget for making.

2.4 Long Run time [hrs]

In [5]:
Runtime <-dataset$runtime # sotre budget 
Runtime[Runtime=="NA"] <- "0" # missing values set to zero 
Runtime<-as.numeric(Runtime)# convert to number
head(sort(Runtime/60,decreasing=TRUE),5)
  1. 6.1
  2. 5.95
  3. 4.66666666666667
  4. 4.51666666666667
  5. 4.03333333333333

The highest runtime was 6.1 hours that was kind of absurd. I donot know who can keep his patient till 6 hours.

3. Logical Segmentation:

3.1 Highest profitable movie genre-> gross- budget & flim type

In [6]:
Gross<-dataset$gross # sotre budget 
Gross[Budget=="NA"] <- "0" # missing values set to zero 
Gross<-as.numeric(Gross)# convert to number
profit<-head(sort(Gross-Budget,decreasing=TRUE),5)
film_genere<-0
for (i in 1:length(profit))
    {
    for (j in 1:nrow(dataset))
      {
        
        
         if (profit[i]==dataset$gross[j]-dataset$budget[j])
             {
                 film_genere[i]<-dataset$genre[j]
             }

             
   }
    
}

Top profitable film Genre:

In [7]:
head(film_genere,5)
  1. 'Action'
  2. 'Action'
  3. 'Action'
  4. 'Drama'
  5. 'Action'

The top mentioned films are the most profitable film genre in future investment.

3.2 Highest User Rating and company

In [8]:
score<-dataset$score # sotre budget 
score[score=="NA"] <- "0" # missing values set to zero 
Rating<-as.numeric(score)# convert to number
score_array<-head(sort(score,decreasing=TRUE),5)
company_h_rating<-0
for (i in 1:length(score_array))
    {
    for (j in 1:nrow(dataset))
      {
        
        
         if (score_array[i]==dataset$score[j])
             {
                 company_h_rating[i]<-dataset$company[j]
             }

             
   }
    
}

Highest user rating Film Companies:

In [9]:
head(company_h_rating,5)
  1. 'Castle Rock Entertainment'
  2. 'Warner Bros.'
  3. 'New Line Cinema'
  4. 'New Line Cinema'
  5. 'New Line Cinema'

The above film companies have most popular films over the time period 1986 to 2016.

3.3 Top voted Actor

In [10]:
# top votes 
votes<-dataset$votes # sotre budget 
votes[votes=="NA"] <- "0" # missing values set to zero 
votes<-as.numeric(votes)# convert to number
votes_array<-head(sort(votes,decreasing=TRUE),5)
a<-0
for (i in 1:length(votes_array))
    {
    for (j in 1:nrow(dataset))
      {
        
        
         if (votes_array[i]== dataset$votes[j])
             {
                 
               a[i]<-dataset$star[j]
               #print(dataset$star[j])
             }

             
   }
    
}
head(a,5)
  1. 'Tim Robbins'
  2. 'Christian Bale'
  3. 'Leonardo DiCaprio'
  4. 'Brad Pitt'
  5. 'John Travolta'

The above actors are top voted actors for a specific movie.

3.4 Top voted Writer

In [11]:
votes<-dataset[dataset$year>2010,c("votes")] # votes after 2010 
votes[votes=="NA"] <- "0" # missing values set to zero 
votes<-as.numeric(votes)# convert to number
votes_array<-head(sort(votes,decreasing=TRUE),5)



for (i in 1:length(votes_array))
    {
    for (j in 1:nrow(dataset))
      {
        
        
         if (votes_array[i]== dataset$votes[j])
             {
                 
               a[i]<-dataset$writer[j]
               #print(dataset$star[j])
             }

             
   }
    
}

Top rated Writter:

In [12]:
head(a,5)
  1. 'Jonathan Nolan'
  2. 'Jonathan Nolan'
  3. 'Quentin Tarantino'
  4. 'Joss Whedon'
  5. 'Terence Winter'

The above data row shows the most popular screenplay writer after 2010.

3.5 Top voted Director

In [13]:
votes<-dataset[dataset$year>2010,c("votes")] # votes after 2010 
votes[votes=="NA"] <- "0" # missing values set to zero 
votes<-as.numeric(votes)# convert to number
votes_array<-head(sort(votes,decreasing=TRUE),5)


for (i in 1:length(votes_array))
    {
    for (j in 1:nrow(dataset))
      {
        
        
         if (votes_array[i]== dataset$votes[j])
             {
                 
               a[i]<-dataset$director[j]
               #print(dataset$star[j])
             }

             
   }
    
}

Top popluar Director:

In [14]:
head(a,5)
  1. 'Christopher Nolan'
  2. 'Christopher Nolan'
  3. 'Quentin Tarantino'
  4. 'Joss Whedon'
  5. 'Martin Scorsese'

The above data row shows the most popular director after 2010.

The dataset has segmented logically based on the following conditions:

  1. Film Genre

  2. Production Company

  3. Popular Actor

  4. Popular Writer

  5. Director

To make a movie, the above-mentioned data are the top considered elements to think about. We need to select a film genre, production company, actor, writer, and director. Together with all characteristics, it is possible to make a popular movie.

4. Static Visualization:

4.1 Pie chart -> gross income by movie genre

In [15]:
gross<-dataset$gross # votes after 2010 
gross_action<-dataset[dataset$genre=="Action",c("gross")] # votes after 2010 
gross_Adventure<-dataset[dataset$genre=="Adventure",c("gross")] # votes after 2010 
gross_Comedy<-dataset[dataset$genre=="Comedy",c("gross")] # votes after 2010 
gross_Drama<-dataset[dataset$genre=="Drama",c("gross")] # votes after 2010 
others=sum(gross)-(sum(gross_action)+sum(gross_Adventure)+sum(gross_Comedy)+sum(gross_Drama))
x <-  c(sum(gross_action),sum(gross_Adventure),sum(gross_Comedy),sum(gross_Drama),others)
labels <-  c("Action","Adventure","Comedy","Drama","Others")
piepercent<- round(100*x/sum(x), 1)

pie(x, labels = piepercent, main = "Gorss Income",col = rainbow(length(x)))
legend("topright", c("Action","Adventure","Comedy","Drama","Others"), cex = 0.8,
   fill = rainbow(length(x)))
x1<-x

The pie chart represents the gross income by percentage. Action genre holds the number one position by scoring about 33% of gross income. Comedy, drama, and adventure are the 2nd, 3rd, and 4th positions. The other category makes almost 23.6% gross income.

4.2 Scatter chart-> Score vs Runtime

In [16]:
# scatter plot 
plot(x = dataset$score,y = dataset$runtime,
   xlab = "Score",
   ylab = "Runtime",
   xlim = c(2,10),
   ylim = c(50,400),		 
   main = "Score vs Runtime"
)

The scatter plot shows the score vs runtime plot. The average runtime is about 120min which gets average 7.5 score.

4.3 Bar Chart: Avg Budget vs Decade

In [26]:
decade1<- mean(dataset[dataset$year<=1995,c("budget")])
decade2<- mean(dataset[dataset$year<=2005,c("budget")])
decade3<- mean(dataset[dataset$year<=2016,c("budget")])
v<-c(decade1,decade2,decade3)
v<-v/1000000
M <- c("1986-1995","1996-2005","2006-2016")
barplot(v,names.arg=M,ylim = c(0,40),xlab="Decade",ylab="Budget [Million]",
main="Budget")

The bar chart expresses the average budget data over the decades. By the way, the average budget has increased over the time.

4.4 Box PLot : Actor vs Gross

In [27]:
votes<-dataset$votes # 
votes[votes=="NA"] <- "0" # missing values set to zero 
votes<-as.numeric(votes)# convert to number
votes_array<-head(sort(votes,decreasing=TRUE),5)
#votes_array
# actor
a<-0
for (i in 1:length(votes_array))
    {
    for (j in 1:nrow(dataset))
      {
        
        
         if (votes_array[i]== dataset$votes[j])
             {
                 
               a[i]<-dataset$star[j]
             }

             
   }
    
}
dataset1<- dataset[dataset$star=="Christian Bale" ,c("star","gross")]
boxplot( (dataset1$gross)/1000000, xlab = "Christian Bale",ylab = "Film Gross Income ", main = "Gross Income of top Star films",names=c("Christian Bale"),ylim=c(0,220))

The above whisker box plot shows the corrosponding film gross income of Christian Bale. The highest income of his film's is about 210 million USD.

4.5 qplot : Vote vs Score

In [28]:
vote<-dataset$vote
score<-dataset$score
qplot(score,vote,data=dataset,geom=c("point","line"),color=("red"))

The above qplot is drawn for score vs votes data of films. The number of vote and score are proportional.

4.6 Histrogram of Movie Runtime

In [29]:
hist(dataset$runtime, 
     main="Histogram for runtime", 
     xlab="Runtime", 
     border="blue", 
     col="green",
     xlim=c(0,500),
     ylim=c(0,4000),
     las=1, 
     breaks=5)

The histrogram is plotted for movie runtime. The most of the films runtime sit between 100 to 150 min.

4.7 Plot function: Votes vs Gross Income

In [30]:
# Plot
v<-dataset$votes
w<-dataset$gross
plot(v,w/1000000,col="red", lwd=5, xlab="votes", ylab="gross", main="Votes vs Gross income")

The above plot shows the votes vs gross income data of movies. The relationship is proportional.

In [ ]:
4.8 Density function of score
In [31]:
plot(density(dataset$score))

The density plot of films score is plotted as the above graph. The density function reaches the max at score 6.8 and density of 0.4

4.9 3D Box plot

In [32]:
if (!require("scatterplot3d")) install.packages("scatterplot3d")
library(scatterplot3d)
scatterplot3d(dataset$budget,dataset$votes,dataset$vote, color=as.integer(dataset$votes))
Loading required package: scatterplot3d

The 3D scatter plot is plotted for budget, score and vote of each films. Most of the low budget films hold the lowest score and votes.

5. Dynamic Plot

In [23]:
library(plotly)
Loading required package: ggplot2

Attaching package: ‘plotly’

The following object is masked from ‘package:ggplot2’:

    last_plot

The following object is masked from ‘package:stats’:

    filter

The following object is masked from ‘package:graphics’:

    layout

5.1 3D plot dynamic plot of score , votes, and budget

In [24]:
data<-dataset[c("score","votes","budget")]

with(data, plot_ly(data, x = score, y= votes, z = budget,
                  size = votes,
                  type="scatter3d", mode="markers"))
Warning message:
“`line.width` does not currently support multiple values.”

The above 3D dynamic plot is for score, votes, and budget.

5.2 Dynamic Bar chart

In [25]:
plot_ly(x = M,y = v,name = "Budget over Decades",type = "bar")

The dynamic bar chart is drawen using plot_ly library. The same graph is also plotted for static plot. Average budget of a movie is increased over the decades.

5.3 Dynamic Pie/donuts Chart: Gross income by film genere

In [33]:
x2<-c("Action","Adventure","Comedy","Drama","Others")
  plot_ly(labels = x2, values =piepercent ) %>%
  add_pie(hole = 0.6) %>%
  layout(title = "Gorss Income by Genre",  showlegend = F,
         xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
         yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))

The donuts chart is plotted for gross income by movie genre.Action genre holds the number one position by scoring about 33% of gross income. Comedy, drama, and adventure are the 2nd, 3rd, and 4th positions. The other category makes almost 23.6% gross income.

5.4 Bubble chart-> score vs runtime

In [34]:
data3<-dataset[c("score","runtime")]
plot_ly(data3, x = ~score, y = ~runtime, type = 'scatter', mode = 'markers',
        marker = list( opacity = 0.5)) %>%
  layout(title = 'Score vs Runtime',
         xaxis = list(showgrid = FALSE),
         yaxis = list(showgrid = FALSE))

The above bubble chart is the representation of score vs runtime. The relationship between the entities is proportional.

5.5 Dynamic Horizontal Box plot

In [35]:
plot_ly(x = ~rnorm(dataset$gross), type = "box") %>%
  add_trace(x = ~rnorm(dataset$score))

The above plot represents the gross income (trace 0) and score (trace 1).

6. Summary

The dataset provides a good insight into movies history. The dataset is reliable. Therefore, the chance to get the most optimal solution for small projects is moderate. The dataset has been segmented using genre, production company, actor, writer, and director. Together with all characteristics, it is possible to make a popular movie.

Action genre movie is the highest gross earning types. The top production houses are Castle Rock Entertainment, Warner Bros, and New Line Cinema. Popular actors based on people's votes are Christian Bale, Leonardo DiCaprio, Brad Pitt, and John Travolta.

The screenplay plays a vital role to succeed in a movie. Top writers are Jonathan Nolan, Quentin Tarantino, Joss Whedon, and Terence Winter. Without a perfect direction, the film never will be a watchable film. Therefore, the selection of the right director is important. The top enlisted film directors are Christopher Nolan, Quentin Tarantino, Joss Whedon, and Martin Scorsese.

7. Recommendations

To produce a blockbuster film, Steven Spielberg can consider the following suggestions:
1. Genre: Action
2. Production company: Warner Bros
3. Actor: Christian Bale
4. Director: Christopher Nolan
5. Writer: Jonathan Nolan
6. Runtime: 110 to 140 min

In [ ]: